Visualizing Data

Visualizing Data

The presentation of data in a pictorial or graphical format.


The most important but dangerous element of data analytics.

Data Visualization Tips

There are a few basic concepts that can help you generate the best visuals for displaying your data:

  • Understand your data.

  • Determine what you want to communicate.

  • Know your audience.

ggplot2

library(ggplot2)

# library(tidyverse)
  • Most robust and versatile
  • Based on the “Grammar of Graphics”
    • Plots are built up in layers

Plot Ingredients

  • Data
  • Mapping: maps variables to plot elements
  • Geometrics: points, lines, boxes, histograms, bars, etc.
  • Scales: controls the mapping of the values in data space to values in aesthetic space
  • Guides: controls how visual properties are mapped back to the data space
    • Labels: axis, legend, titles
  • Themes: visual themes for the plot.

The Big 3

Only 3 ingredients are required to make a plot.

  1. Data
  2. Mapping / Aesthetics
  3. A “geom”
?ggplot()
ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())

1. Data

Always begin with the main function in ggplot2: ggplot

**Data are specified via the “data” argument:

ggplot(data = mydata)

This argument supplies a coordinate system to add layers to.

2. Aesthetics

aes() maps variables from a data set to various elements of a plot

  • Discrete values (groups / categories) can have color, shape, linetype, or fill mappings.

  • Points can have an additional x and y position mappings.

Mappings go into the aes() function as the 2nd argument in ggplot().

ggplot(data=df, aes(x=V1, y=V2, color=V3))

Any part of the plot related to the data goes in aes()

3. Geoms

-geoms are the type of geometrics in your plot.

Common geoms include:

  • geom_boxplot()
  • geom_histogram()
  • geom_line()
  • geom_density()
  • geom_bar()
  • geom_point()
  • ETC

IMPORTANT

ggplot() is built in layers

Use the + operator to add layers to the exisiting ggplot() object.

In this way, your code is explicit about which layers are added and in what order.

ggplot(data=mydata, aes(x=V1, y=V2, color=V3)) + geom_point()

Have Data?

Variation in Design

To build your plots layer by layer, you use a continuous combination of geoms:

ggplot(mydata, aes(x, y)) + geom_point() + geom_line()

PSA:

mydata %>% ggplot(aes(x, y)) + geom_point() + geom_line()

Adding layer by layer:

my_plot <- ggplot(df, aes(x, y))
my_plot <- my_plot + geom_point()
my_plot <- my_plot + geom_line()

Printing Plots

You do not need to create an object for the plot:

ggplot(data=df, aes(x=V1, y=V2, color=V3)) + geom_point()

BUT you can assign your plot to a variable…

my_plot <- ggplot(df, aes(x, y)) +  geom_point()

…and then print / view your plot

my_plot

Building Common Visualizations

Boxplots

Visualize the distribution of continuous variables by plotting its five-number summary:

  • Minimum
  • 25th percentile
  • Median (50th percentile)
  • 75th percentile
  • Maximum

Boxplots

One continuous variable and one discrete variable

gol <- howells[howells$Population == 'ARIKARA' | howells$Population =='HAINAN' | howells$Population == 'NORSE', ]

ggplot(gol, aes(x=Population, y=GOL)) + geom_boxplot()

Boxplots: Discrete Colors

Discrete variables can also be used to differentiate plot elements by including in the aes() function

ggplot(gol, aes(x=Population, y=GOL, color=Sex)) + geom_boxplot()

Boxplots: Discrete Colors

Discrete variables can also be used to differentiate plot elements by including in the aes() function

ggplot(gol, aes(x=Population, y=GOL, fill=Sex)) + geom_boxplot()

Histograms, Density Plots, and Bar Plots

One vector / column of you data

Histograms

  • Divides the range of scores into a specified number of “bins” on the x-axis and displays the count on the y-axis
    • Continuous data only
ggplot(faithful, aes(x=waiting)) +  geom_histogram()

Histograms: binwidth, fill, and color

ggplot(faithful, aes(x=waiting)) +  geom_histogram(binwidth=5,  fill="white", color="black")

Histograms: Faceting

facet_grid() separates (“facets”) the plots by rows, columns, or both.


General format: ROWS ~ COLUMNS


Note: [ . ] means “not the dimension.”

library(MASS) 
data(birthwt)


ggplot(birthwt, aes(x=bwt)) +   geom_histogram(fill="white", color="black") +   facet_grid(smoke~.) + mytheme

Density Plots

“Nonparametric method for estimating the probability density function of a random variable”
AKA it gives you the proportion instead of the count
Results in a smoothed line

ggplot(faithful, aes(x=waiting)) +  geom_density()

Density Plots: Sensitivity

ggplot(faithful, aes(x=waiting)) +  geom_density(adjust=0.25)

Density Plots: Fills

geom_density() can be filled with a color.


NOTE: alpha is used to indicate transparency


0 == transparent.


1 == opaque

ggplot(faithful, aes(x=waiting)) +  geom_density(fill="blue", alpha=0.5)

Density Plots: Factors

Just like other plots we have seen, a discrete variable (factor) can be used to separate groups

ggplot(birthwt, aes(x=bwt,  fill=factor(smoke))) +  geom_density(color=NA,          alpha=0.5)

Adding Layers

# Note the use of after_stat() this scales the histogram with the density. Remember, histograms are counts. 

ggplot(faithful, aes(x=waiting, y=after_stat(density))) +
    geom_histogram( fill="white", color="black") +  geom_density(fill="steelblue", alpha=0.4)

Bar Plots

Displays the counts of discrete/ordinal variables.

ggplot(diamonds, aes(x=cut)) + geom_bar()

Bar Plots: Bar Width

ggplot(diamonds, aes(x=cut)) + geom_bar()

ggplot(diamonds, aes(x=cut)) + geom_bar(width = 0.1)

ggplot(diamonds, aes(x=cut)) + geom_bar(width = 1)

Bar Plots: Factors

ggplot(diamonds, aes(x=cut, fill=color)) +  geom_bar()

ggplot(diamonds, aes(x=cut, fill=color))+ geom_bar(position="dodge")

Line Plots

Visualizes how one variable on the y-axis changes in relation to changes in the x-axis


Can represent discrete(categorical) or continuous(numeric) variables on the x-axis

geom_line example

ggplot(BOD, aes(x=Time, y=demand)) + geom_col()

ggplot(BOD, aes(x=Time, y=demand)) + geom_line()

ggplot(BOD, aes(x=Time, y=demand)) + geom_line() + geom_point()

Change Point Characters

  • The pch command changes the point type.
ggplot(BOD, aes(x=Time, y=demand)) + geom_line(lty=2) + geom_point(pch=7)

Group by Linetype

ggplot(ToothGrowth, aes(x=dose, y=len, lty=supp)) + geom_line()

Size

Use size to adjust the size of the elements.

tg <- ToothGrowth %>% group_by(dose, supp) %>% summarize(len = mean(len))


ggplot(tg, aes(x=dose, y=len, shape=supp, lty=supp)) + geom_line(lwd=2) + geom_point(size=6)

Scatterplots

Scatterplots
* Bi-variate scatter plots help you visualize relationships between two quantitative / continuous variables * When there are additional variables being explored you can use a scatterplot matrix
* Helps identify outliers * Helps identify multicollinearity * Includes stat_functions (linear and loess lines) * Can also incorporate boxplots / histograms / rug plots

library(MASS)
bw <- birthwt %>% dplyr::select(age, lwt, smoke, bwt)

# create labelled factors
unique(bw$smoke)
[1] 0 1
bw$smoke <- factor(bw$smoke, levels=c(0,1), labels=c("No", "Yes"))


# plot
ggplot(bw, aes(x=lwt, y=bwt)) + geom_point()

Scatterplots: Discrete Color

ggplot(bw, aes(x=lwt, y=bwt, color=smoke)) + geom_point()

Scatterplots: Discrete Shapes

ggplot(bw, aes(x=lwt, y=bwt, shape=smoke)) + geom_point(size=4)

Scatterplots: Specifying Shape

scale_shape_manual() is used to specify the values of the pch

ggplot(bw, aes(x=lwt, y=bwt, shape=smoke)) + geom_point(size=4) +   scale_shape_manual(values=c(1,16))

Scatterplots: Adding a Stat Line

stat_smooth() is used to add a line which represents the statistical procedure specified by the “method” argument

ggplot(bw, aes(x=lwt, y=bwt)) + geom_point(size=4) + 
    stat_smooth(aes(color=smoke), method=lm, lwd=4)

ggplot(bw, aes(x=lwt, y=bwt)) + geom_point(size=4) + stat_smooth(aes(lty=smoke), method=lm, lwd=4, color="red")

geom_smooth()

geom_smooth() adds a loess smoothing line to our plot


Loess = Locally Estimated Scatterplot Smoothing

ggplot(bw, aes(x=lwt, y=bwt)) + geom_point(size=4) + 
    geom_smooth(se=FALSE, lwd=2, color="red")

Labels + Titles

pc <- read.csv('/Users/christopherwolfe/Library/CloudStorage/GoogleDrive-chriswolfe93091@gmail.com/My Drive/ECU_Courses/Spring2025/ANTH_Stat/data/goldman_pc.csv')

library(magrittr)

pc %<>% filter(Inst %in% c("DC", "KSU", "NM", "WOAC"))

example_plot <- ggplot(pc, aes(x=RRML, y=lhml, color=Inst)) + geom_point(size=3) + scale_color_manual(values=c("red","green","goldenrod","purple"))

example_plot <- example_plot + labs(x="Radius max. length", y="Humerus max. length", title="Scattered", color="Institution")

example_plot

Axes

example_plot + xlim(200, 300) + ylim(250, 350)

Themes

  • Themes are used to adjust the visual appearance of elements in your plot.

  • element_blank()

  • element_line()

  • element_rect()

  • element_text()

  • element_blank() is used to remove a specified element from the plot

  • element_text() is the most used

plot + theme_bw() + theme(axis.title.x = element_text(colour="red", size=12))

element_text()

  • element_text() is used to change fonts, sizes, justification, angles, and more

  • help(element_text()) for more options

    • TIP: Load ggplot2 and type help(theme) for all theme elements
theme(title=element_text(size=16)
theme(plot.title=element_text(size=22)
theme(axis.title=element_text(size=16)
theme(axis.text=element_text(size=12)
theme(legend.text=element_text(size=12)

Pre-Made Themes

  • library(ggplot2) comes with several premade themes and library(ggthemes) includes many more… go explore!

  • Premade themes are functions and are called by the code: my_plot + theme_bw()


These themes are often good starting points for creating a custom theme

Premade Themes

Find your style!

Code
mytheme <-  theme_bw() + theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(),                               legend.background=element_blank(), legend.box.background=element_rect(color='black'), legend.key=element_blank(), legend.title=element_text(face='plain',size=14), legend.text=element_text(size=12), axis.title=element_text(size=15, lineheight=.9, vjust=.3), axis.text=element_text(size=12), axis.title.x=element_text(vjust=.2), axis.title.y=element_text(vjust=.3), plot.title=element_text(size=18), strip.text=element_text(size=12))